About the Data

This report explores a dataset which contains 4,898 white wines with 11
variables on quantifying the chemical properties of each wine. At least 3 wine
experts rated the quality of each wine, providing a rating betweern 0 (very
bad) and 10 (very excellent).

## 'data.frame':    4898 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
##  $ volatile.acidity    : num  0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
##  $ citric.acid         : num  0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
##  $ residual.sugar      : num  20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
##  $ chlorides           : num  0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
##  $ free.sulfur.dioxide : num  45 14 30 47 47 30 30 45 14 28 ...
##  $ total.sulfur.dioxide: num  170 132 97 186 186 97 136 170 132 129 ...
##  $ density             : num  1.001 0.994 0.995 0.996 0.996 ...
##  $ pH                  : num  3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
##  $ sulphates           : num  0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
##  $ alcohol             : num  8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
##  $ quality             : int  6 6 6 6 6 6 6 6 6 6 ...
##  [1] "fixed.acidity"        "volatile.acidity"     "citric.acid"         
##  [4] "residual.sugar"       "chlorides"            "free.sulfur.dioxide" 
##  [7] "total.sulfur.dioxide" "density"              "pH"                  
## [10] "sulphates"            "alcohol"              "quality"
##  fixed.acidity    volatile.acidity  citric.acid     residual.sugar  
##  Min.   : 3.800   Min.   :0.0800   Min.   :0.0000   Min.   : 0.600  
##  1st Qu.: 6.300   1st Qu.:0.2100   1st Qu.:0.2700   1st Qu.: 1.700  
##  Median : 6.800   Median :0.2600   Median :0.3200   Median : 5.200  
##  Mean   : 6.855   Mean   :0.2782   Mean   :0.3342   Mean   : 6.391  
##  3rd Qu.: 7.300   3rd Qu.:0.3200   3rd Qu.:0.3900   3rd Qu.: 9.900  
##  Max.   :14.200   Max.   :1.1000   Max.   :1.6600   Max.   :65.800  
##    chlorides       free.sulfur.dioxide total.sulfur.dioxide
##  Min.   :0.00900   Min.   :  2.00      Min.   :  9.0       
##  1st Qu.:0.03600   1st Qu.: 23.00      1st Qu.:108.0       
##  Median :0.04300   Median : 34.00      Median :134.0       
##  Mean   :0.04577   Mean   : 35.31      Mean   :138.4       
##  3rd Qu.:0.05000   3rd Qu.: 46.00      3rd Qu.:167.0       
##  Max.   :0.34600   Max.   :289.00      Max.   :440.0       
##     density             pH          sulphates         alcohol     
##  Min.   :0.9871   Min.   :2.720   Min.   :0.2200   Min.   : 8.00  
##  1st Qu.:0.9917   1st Qu.:3.090   1st Qu.:0.4100   1st Qu.: 9.50  
##  Median :0.9937   Median :3.180   Median :0.4700   Median :10.40  
##  Mean   :0.9940   Mean   :3.188   Mean   :0.4898   Mean   :10.51  
##  3rd Qu.:0.9961   3rd Qu.:3.280   3rd Qu.:0.5500   3rd Qu.:11.40  
##  Max.   :1.0390   Max.   :3.820   Max.   :1.0800   Max.   :14.20  
##     quality     
##  Min.   :3.000  
##  1st Qu.:5.000  
##  Median :6.000  
##  Mean   :5.878  
##  3rd Qu.:6.000  
##  Max.   :9.000
## 
##    3    4    5    6    7    8    9 
##   20  163 1457 2198  880  175    5

We dropped the column X which appears to be the row number. quality is an
ordered categorical variable with score between 0 and 10. It’s interesting
to see that our wine experts don’t rate the wines as extreme as of score 0
(very bad), 1, 2, or 10 (very excellent). The actual range is from 3 to 9 with
median at 6. The rest of variables are continuous variables which makes sense
since they represents the amount of the corresponding substance in the wine,
based on physicochemical tests.

Univariate Plots Section

From this histogram of quality counts, we can see it’s a normal distribution
with mean (solid line) and median (dashed line) with almost the same value.
Most of the white wine has a quality of 6, and second place is 5.

## 
## FALSE  TRUE 
##  3838  1060
## 
## midiocre  premium 
##     3838     1060

I remember my wine teacher always talks about Pareto principle (80/20 rule)
in the wine industry (Yes, I had a wine teacher). Wine of quality 7, 8, 9
makes up 27.6% of the total number of white wines rated. Therefore, we will
consider quality of 7, 8, 9 as premium.

## 
## FALSE  TRUE 
##  4613   285

All wines contains sulphur dioxide in various forms, collectively known as
sulphites. Even in completely unsulphured wine it is present at concentration
of up to 10 mg/L. Commercially-made wines contain from ten to twenty times
that amount. (Source:
morethanorganic)

Reasons why SO2 is not desirable in wine:

According to EU law, the maximum permitted level of SO2 in white/rose wine
is 210 mg/l. As you can see in the first histogram, there are 285 wines
exceeded this limit. And we can observe that all three of them have a
right-skewed distribution. This might be due to the restriction of the
sulphate and most of the vineyards would obey the rules and avoid exceeding
the limit.

## [1] 3.188267
## [1] 3.18

In this set of histograms, we explore the acidity in wines. We have the first
three variables which are the amount of corresponding acid found in the wines.
The fourth variable pH indicates the acidity level where 7 is neutral and
smaller the value is, more acidic the liquid is. We observe a right skewed
distribution of the first three and a normal distribution of PH with median
and mean at 3.18 (acidic). It makes sense to have the PH histogram not right
skewed as the above 3 ones since the outliers in the acidity histogram would
have a lower PH value (tail on the left of PH histogram).

Some people believe that sweeter a wine is, the more alcohol it should
contain. We cannot tell this just by looking at the histogram here yet. We
will more into it in the bivariate plot section. Here we can see both residual
sugar and salt have very right skewed distribution. And the amount of salt is
really tiny for all white wines with maximum of 0.346 g/L. Histogram of
alcohol is a bit right skewed with peaks at around 9 - 9.5 %, it also is quite
uniform distributed other than the peak points. Most wines have alcohol level
of 8.5 - 12 %.

We take residual sugar only and add the feature of type to observe the
distributions for wines of both types. Log transformation is used here to
make the right skewed distribution look more standard.

We can see the shape are overall similar. With most wines have between 1 to 20
g/dm3 of residual sugar. Mediocre wine has a distinguishable bimodal normal
distribution with peaks at 1.7 and 8. Distribution of premium wines has three
peaks at 1.7, 5 and 13.5, but differences between local minimas and maximias
are not as extreme as those of mediocre wines.

Univariate Analysis

What is the structure of your dataset?

The white wine quality dataset consists of 4898 observations and 12 variables.
Each observation is a white variant of the Portuguese “Vinho Verde” wine.
Among the 12 variables, there are 11 input variables (numeric) which represent
the amount of corresponding substance existing in the wine based on
physicochemical tests. The output variable quality is based on sensory data
(median of at least 3 evaluations made by wine experts), and it is an ordered
categorical data with range between 0 (very bad) and 10 (very excellent).

What is/are the main feature(s) of interest in your dataset?

The main feature of interest is quality. I am curious in knowing how does
the amount of other factors affect the rating from the wine experts.

What other features in the dataset do you think will help support your
investigation into your feature(s) of interest?

By reading the description of the dataset here,
I think volatile acidity, citric acid, free sulfur dioxide,
total sulfur dioxide, density may support my investigation. Because they
seems to affect the smell, taste and color. density may contribute to the
effect of “wine curtains” which is also a essential part of wine tasting.

Did you create any new variables from existing variables in the dataset?

Yes, I created a new categorical variable type to indicate whether a wine
is premium or mediocre where premium wines are the ones rated above 7
quality and mediocre the rest.

Of the features you investigated, were there any unusual distributions?
Did you perform any operations on the data to tidy, adjust, or change the form
of the data? If so, why did you do this?

I dropped the column X which appears to be the row number. I also changed
quality to a ordered categorical variable with score between 0 and 10.

Many variables have a right skewed distribution with outliers on the far end
of the tail. However, quality is quite normal distribution with no extreme
value like 0, 1, 2, or 10. I haven’t removed the “outliers” from the dataset
because at this point I am not sure if their extremeness contribute to the
feature of interest.

Bivariate Plots Section

This scatterplot matrix with all variables gives broad overview of what
variables might be interesting.

This correlation plot shows more clearly about the correlation between
variables.

An interesting observation is that the outliers they happen more at the
middle range qualities (5, 6, 7) than the extreme values. Very small amounts
of outliers can be observed for 9-quality or 3-quality wines.

If you look at the boxplot at quality 9 for each factor, notice that the “box”
is generally smaller than other qualities (especially density,
sulfur.dioxide). This suggests that there is a specific set of
charateristics in order to be rated as an “very excellent” quality Portuguese
“Vinho Verde” white wine. At this point, I’m impressed by the wine experts who
rated these wines. Just by blind tasting, they can detect the excellent wine
with the exact right amount of each substances.

I really like this boxplot of alcohol. To reach a quality of 9, alcohol
level has to be precisely around 12.5%. However, for other wines at same
quality, the alcohol level can have 6% in difference. Overall, there is a
trend starting from level 5 up of more alcohol, better the wine.

Looking at this scatterplot between residual.sugar and density, we can
spot a positive correlation between the two variables. An outlier can also be
spotted, it seems to be a good data with extreme value since it still respects
the density, residual.sugar correlation, but it just has an extreme high
residual.sugar level. It must be a really sweet wine. We will eliminate this
outlier from out spot and suset our data with residual.sugar less than 30
g/dm3.

Something interesting is happening here. Overall, the linear smooth line fits
well on the scatterplot. However, lower the residual.sugar level, wider the
range of density at the same residual.sugar level. My guess is when
residual.sugar level is low, density can be correlated with a third
variables.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. How did the feature(s) of interest vary with other features in
the dataset?

From the set of boxplots, we can observe that alcohol seems to be
appreciated. With higher alcohol level, the median rating of quality is
generally higher.

pH, fixed.acid and citric acid shows slight positive correlation as well.

On the other side, sulfur.oxide, sugar, and density are not appreciated,
negatively correlated to quality.

Did you observe any interesting relationships between the other features
(not the main feature(s) of interest)?

We can observe that there is strong (0.853) correlation between density and
residual.sugar which is what I suspected before.

It’s only nutural to see that free.sulfur.dioxide and total.sulfur dioxide
has a correlation of 0.61.

Also as suspected before, sugar and density has a strong correlation
of 0.839. All other factors somewhat contribute to density a bit as we can
see the correlation ranges from 0.15 to 0.839 for density with other factors
except for the factor volatile acidity (corr: 0.0271).

Surprisingly, alcohol and residual.sugar have a negative correlation
of -0.427. alcohol and density also have a strong negative correlation of
-0.711, which makes sense since density and residual.sugar are highly
positively correlated.

From the boxplots on the quality column, we suspect that alcohol,
total.sulfur.dioxide, and density have some effects on the ratings of
wine quality by the wine experts.

What was the strongest relationship you found?

The strongest relationship I found is between residual.sugar and density.
They have a correlation of 0.853. density and alcohol also has a strong
negative correlation of -0.78.

Multivariate Plots Section

## [1]   9 440

Following previous analysis between residual.sugar and density,
total.sulfur.dioxide is added as a feature here. I first cut the variable
into 4 buckets (0, 100], (100, 150], (150, 210], (210, 440]. My guess
earlier was correct. At the same residual.suagr level, wines with lower
total.sulfur.dioxide level are less dense.

To make this set of plots, outliers (residual.sugar > 30) are removed from
the dataset.

We can see that the strong correlation between density and sugar doesn’t
change at no matter what quality.

Observe the second plot, we can see that at same level of sugar, premium wines
are less dense than midiocre wines. Mediocre wine also have a bigger range
of residual.sugar level (the outliers we didn’t show are also mediocre wines).

total.sulfur.dioxide and density are not as correlated sugar with
density but we can observe the same trend that the line of fit for premium
is lower than midiocre.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. Were there features that strengthened each other in terms of
looking at your feature(s) of interest?

sugar and density seems to strengthen each other in terms of looking at
quality. At the same sugar level, premium wines tend to have less
density than mediocre wines. Extremely high sugar level has lower chance
of being rated as excellent wines.

Were there any interesting or surprising interactions between features?

Wines at quality levele 5, 6, 7 always have extreme level in features like
residual.sugar and sulfur.dioxide. This is surprising as they are not
rated as bad wines (level 2,3) but OK wines.


Final Plots and Summary

Plot One

Description One

This is a histogram of quality counts of the wines. The dashed lines is the
median and solid line is mean. We can see that it’s a normal distribution with
mean and median at 6.

Plot Two

Description Two

This plot has 3 dimensions, residual.sugar, total.sulfur.dioxide.bucket
cut from total.sulfur.dioxide and density. We can observe the positive
correlation between residual.suagr and density. At the same
residual.suagr level, wines with lower total.sulfur.dioxide level are less
dense.

Plot Three

Description Three

From this plot, we can see that at same level of sugar, premium wines are less
dense than midiocre wines. Mediocre wine also have a bigger range of
residual.sugar level (the outliers we didn’t show here are also mediocre
wines).


Reflection

At the beginning it was hard to understand what does each numeric variables
mean and how could they affect the quality of wine. After doing some research
and read more carefully on the documentation of the dataset, it became more
clear how I could explore this dataset. Another struggle is that there is
really subtle differences in the amount of variables, you can see from the
scatterplots that all the points are kind of all cluster together, it’s hard
to visualize when you just put quality as color in the same scatterplot. Maybe
some tranformation of data could be used in the future, to make it possible to
visually separate the clusters.